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Abstract. Neighbourhood-based Collaborative Filtering (CF) has been 
applied in the industry for several decades, because of the easy implemen¬ 
tation and high recommendation accnracy. As the core of neighbourhood- 
based CF, the task of dynamically maintaining users’ similarity list is 
challenged by cold-start problem and scalability problem. Recently, sev¬ 
eral methods are presented on solving the two problems. However, these 
methods applied an 0{n^) algorithm to compute the similarity list in a 
special case, where the new users, with enough recommendation data, 
have the same rating list. To address the problem of large computational 
cost caused by the special case, we design a faster algorithm, 

TwinSearch Algorithm, to avoid computing and sorting the similarity 
list for the new users repeatedly to save the computational resources. 
Both theoretical and experimental results show that the TwinSearch Al¬ 
gorithm achieves better running time than the traditional method. 
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1 Introduction 


In recommender systems. Collaborative Filtering (CF) is a famous technology 
with three main popular algorithms [^, i.e., neighbourhood-based methods [^, 
association rules based prediction , and matrix factorisation . Among these 
algorithms, neighbourhood-based methods are the most widely used in the in¬ 
dustry because of its easy implementation and high prediction accuracy. 

The core of neighbourhood-based CF methods is the computation of a sorted 
similarity list for every user. The task of dynamically maintaining a similarity 
list is important in a neighbourhood-based recommender system, as the creation 
of new users and the rate updates of old users will result in an update of the 
similarity list frequently. Accordingly, there are two main research problems in 
recommender systems, one is the cold-start problem, the other is the scalabil¬ 
ity problem [^. Recent research [6||^ on addressing cold-start problem focus 
on improving the prediction accuracy with the limit rates information. While, 
Some of the solutions 10 ■ 13 to the scalability problem work on decreasing the 


computational cost by linking the new similarity list with the old one. 
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Different from the two classic problems, we notice a special case where the 
methods listed above do not work well. In this case, we assume the new users 
have enough rating data to build reliable similarity list, also, they have totally 
the same ratings list. The methods aim to solve cold-start problem or scalability 
problem will treat this special case as a normal input of recommender systems, 
then apply an 0{n^) algorithm to compute the new users’ similarity list. Con¬ 
sidering the number of users in a recommender system, n, is usually very large, 
the computational cost of the above method will be very large. 

To address the problem of large computational cost caused by the special 
case, we design a faster algorithm, TwinSearch Algorithm, to avoid 

computing and sorting the similarity list for the new users repeatedly to save the 
computational resources. Moreover, we compare the running time of TwinSearch 
algorithm and the traditional similarity computation method in two real-world 
data sets on both user-based and item-based CF. Both theoretical and experi¬ 
mental results show that the TwinSearch algorithm achieve better running time 
than the traditional method. To the best of our knowledge, we are the first to 
consider this special case in recommender systems. 

The rest of this paper is organised as follows: Firstly, in Sectionwe discuss 
the existing technologies to dynamically maintain the similarity list in recom¬ 
mender systems. Afterwards, in Section]^ we present a faster algorithm to build 
new users similarity list in neighbourhood-based CF recommender systems, and 
analyse the time complexity of our novel algorithm theoretically. Next, in Sec¬ 
tion]^ the experimental analysis of our algorithm on the performance of running 
time are provided. Finally, we conclude with a summary in Section 


2 Related Work 

In this section, we briefly summarise some of the current works on addressing 
cold-start problem and scalability problem. 

Some of the methods on cold-start problem focus on the improvement of 
prediction accuracy, due to the lack of enough rating data of new users. Theses 
methods gain better prediction performance by applying different strategy. For 
example, Bobadilla et al. presented a new similarity measure with optimiza¬ 
tion based on neural learning, which shows the much better results than current 
metrics, such as cosine similarity measure. Liu et al. showed an interesting 
phenomenon that to link a cold-start item to inactive users will give this new 
item more chance to appear in other users recommendation lists. Adamopoulos 
et al. applied probabilistic method to select the k neighbours from the entire 
candidate list, rather the k nearest candidate, to avoid the low prediction accu¬ 
racy due to the lack of rates data. Lika et al. proposed an approach which 
incorporates classification methods in a pure CF system while the use of demo¬ 
graphic data help for the identification of other users with similar behaviour. 

Some of the solutions to scalability problem proposed the methods based 
on incremental updates of user-to-user and item-to-item similarity. These meth¬ 
ods achieve faster and high-quality recommendation than the traditional CF. 
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Papagelis et al. 10 proposed an incremental method which quickly updates 
user’s similarity list when the user adds/rates new items in the recommender 
systems. Liu et al. 11 presented the temporal relevance measure for ratings 


at different time steps and developed online evolutionary collaborative filtering 
algorithms by introducing this measure into fcNN algorithms and incrementally 
computing neighbourhood similarities, which achieve both better time and space 
complexity. Inspired by 10 , Yang et al. 12 developed the user-based incremen¬ 


tal similarity update method to an corresponding item-based method. Huang et 
al. 13 proposed a practical item-based CF algorithm on big data environment, 


with the super characteristics such as robust to the implicit feedback problem, 
scalable incremental update and real-time pruning. 

Unfortunately, the current methods on cold-start problem and scalability 
problem do not work well on a special case: the new users, with enough rec¬ 
ommendation data, have the same rating list {k Nearest Neighbour (/cNN) at¬ 
tack 14 can be taken as an example of our special case, which creates k same 


fake users with at least 8 rated items into the recommender system). The rea¬ 
sons are the solutions to cold-start problem only work on the new users which 
have not been gathered sufficient information, and the methods concentrating 
on scalability problem only work on the old users who have already have a sim¬ 
ilarity list. Naturally, when facing the special case, the above methods have to 
apply the traditional similarity computation method which yields in O(n^) time 
complexity. Considering the number of users in a recommender system, n, is 
usually very large, the computational cost of the above method will be very 
large. Therefore, it is necessary to gain a faster algorithm to build the new users 
similarity list in our special case. 


3 The TwinSearch Algorithm 

3.1 Algorithm Design 

In this section, we define the users who have the same rating list as twin users. 
To address the large computational cost due to the special case: the new users, 
with enough recommendation data, have the same rating list, we aim to avoid 
computing and sorting the similarity list for the new users repeatedly to save 
the computational resources. Since the new users are the same, our strategy to 
avoid repeated computation is searching the twin user from the system, then 
copying the twin user’s similarity list to the new user directly. 

According to the properties of the similarity in recommender systems, we 
know that if two users are twin user, i.e., Ua = Ub, then the similarity between 
an arbitrary user Ui and Mq, ut are equal, i.e., siniai = sirubi- Based on the 
definition of twin user, the ratings on any item i of twin user are equal, i.e., 
fai = fbi- Therefore, we have the following relationships: 

Ua = Ub ^ siruai = simbi 


^ai — ’^bi 


( 1 ) 

( 2 ) 
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Relationship [l] helps us to find the potential twin users from the system, Rela¬ 
tionship helps us to find the exact twin user from the potential ones. Now we 
design the TwinSearch Algorithm to find and copy the twin user’s similarity list 
to the new user by relationship and 


Algorithm 1 TwinSearch Algorithm. 

Input: 

A user-item rating set, TZ, with n users and m items; a user-user sorted similarity 
matrix, 5; a new user, ug, with several ratings on different items; a constant, c G Z"*". 

Output: 

The new user uo’s similarity list. 

1: Select c random users, «*, i € [1, c]; 

2: for f = 1 to c do 

3: compute similarity between user ug and u*, simgi; 

4: search u*’s similarity list Si for a Seti = {ux\simix = simoi}', 

5: if simoi = 1 then 

6: add u* to SeU', 

7: end if 

8: end for 

9: Compute the intersection Setg of the c SeUs, Setg = SeU', 

10: for i = 1 to |5'eto| do 

11: if Tij = roj, j G [l,m] then 

12: copy the similarity list of Ui G Setg to wo; 

13: break', 

14: end if 

15: end for 

16: return The new user uo’s similarity list. 


In line 4, we search the potential twin users by Relationship In line 9, 
we narrow the size of the final potential twin user set Seig by intersecting the 
c bigger potential twin user set Seti. The for loop in lines 10-15 find the twin 
user from the potential twin users’ set by Relationship!^ Our algorithm can be 
worked in both user-based and item-based CF, in this section, we present the 
TwinSearch algorithm from the perspective of the user-based methods, and this 
can be applied to item-based methods in a straightforward way. 

3.2 Time Complexity Analysis 

We select the c random users in line 1 in 0(c). The for loop in lines 2-8 contributes 
0{c{m+\ogn)) to running time, if we use binary search in line 4. To compute the 
intersection Setg in line 9, it takes 0{cn) time. The for loop in lines 10-15 requires 
0{\SetQ\m) time, if we use the link list as the similarity matrix S data structure. 
Therefore, the total running time of Algorithmis 0{\SetQ\m + c{m + logn)). 

Now we focus on the value of |S'eto|- Because Setg = l^etol < 

min{|5'eti|}, i.e., jS'etol = max{min{|S'eti|}}, i G [1, c]. As the values in a specific 
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Seti are equal, Seti must be included in one sub-list of the original similarity list. 
The sub-list is produced by partitioning the similarity list with the similarity 
value. For example, suppose that we have x sub-lists, then the similarity value 
in each sub-list is in the range of [0, ^), [^, |), • ’ ’ > [1 ~ 1-0] correspondingly. 

Thus, the upper bound of |S'eto| must be less than the size of largest sub-list. 

Moreover, Wei et al. showed that any user’s similarity list obeys a spe¬ 
cific Gaussian distribution in recommender systems. In this paper, because of the 
value of similarity, we set the sample range in [0, 1.0]. Since for any Gaussian 
distributions, more than 99.99% samples are in the range of [n — 4:a,fi + dcr], 
we fix the similarity value range [0, 1.0] within /r ± 4cr in this paper. Fig¬ 
ure [2 shows the basic statistic settings of one similarity list, where the great¬ 
est size sub-list’s similarity value range is between [/x — k^a, fj, + k^^a]. There¬ 
fore, we have the size of the sub-list with the most number of users, s = 

Area under the Gaussian distribution curve between and ^-\-k 4 cr rl* re fbo 

Area under the Gaussian distribution curve between 0 and 1.0 ^ ^^6 

property of Gaussian distribution, we rewrite the expression of s as: 


ji/ M + 1 j,/-1 

_ <g(/=3)+<g(fc4)-l " 


( 3 ) 


where ^(a;) is the cumulative distribution function of standard Gaussian distri¬ 
bution. Our goal is to find the maximum value of s. 



Fig. 1: Distribution of user’s similarity list 


In fact, for a specific Gaussian distribution and a partition parameter x, the 
area under the Gaussian distribution curve between fj.— k^a and ^ + k 4 a is fixed. 
But, when ki = fca, the area under the Gaussian distribution curve between 0 
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and 1.0 reaches the minimum value. Thus, when fci = k^, the value of s is 
maximum. Then, we have the following linear programming: 


• ■ <l>(k3)+<P(k4)-l 

maximise s = x n 

subject to /i — fciCT = 0 

H + ^ 2(7 = 1 

/r — fcscr = 0 
M + fc4Cr = i 
0 < fci < 4,0 < fca < 4 
0 < fca, 0 < k^. 


(4) 


According to the properties of the cumulative distribution function of stan¬ 
dard Gaussian distribution, we have the solution for linear programming Q: 
fci = 0, ^2 = 4, /ca = 0, ^4 = 0.01. Then we have the maximum s = j^n which 
is the upper bound of |S'eto|- In this paper, we assume c <C Therefore, the 
overall running time of TwinSearch Algorithmj^is O(j^mn), which is much less 
than the running time {0{mn)) of traditional similarity computation method. In 
this paper, we assume there are k new same users will be created in the system, 
so the total running to build the k users in traditional similarity computation 
method is 0{kmn), while in the TwinSearch algorithm, it is 0((1 -I- 


4 Experimental Evaluation 

In this section, we use the real-world datasets to evaluate the performance on 
time complexity of TwinSearch Algorithm and traditional similarity computa¬ 
tion method. We begin by the description of the datasets, then perform a com¬ 
parative analysis of our algorithm and the traditional similarity computation. 


4.1 Datasets and Experimental Settings 

In the experiments, we use two real-world datasets, MovieLens dataselQ and 
Doubar0 (one of the largest rating websites in China) film datase10 The Movie- 
Lens dataset consists of 100,000 ratings (1-5 integral stars) from 943 users on 
1682 films, where each user has rated at least 20 films, each film has been rated 
by 20—250 users. The Douban film dataset contains 16,830,839 ratings (1-5 in¬ 
tegral starts) from 129,490 unique users on 58,541 unique films [^. All the 
experiments are implemented in MATLAB 8.5 (64-bit) environment on a PC 
with Intel Core2 Quad Q8400 processor (2.67 GHz) with 8 GB DDR2 RAM. 

4.2 Experimental Results 

We design 4 experiments (Figure to to evaluate the running time for k 
new user with same ratings on the above two data sets in both user-based and 

^ http://www.grouplens.org/datasets/movielens/ 

^ http://www.douban.com 

^ https://www.cse.cuhk.edu.hk/irwin.king/pub/data/douban 
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item-based CF. We use cosine similarity metric as the traditional similarity com¬ 
putation method, and set fc = 30 in the 4 experiments. From the 4 figures, we 
can see that the TwinSearch algorithm achieves much better performance on 
time complexity than the traditional similarity computation method. 




Number of Users Number of Users 


Fig. 2: Running time of User-based CF 
on MovieLens 


Fig. 3: Running time of User-based CF 
on Douban film 




Fig. 4: Running time of Item-based CF Fig. 5: Running time of Item-based CF 
on MovieLens on Douban film 


5 Conclusion 

Neighbourhood-based Collaborative Filtering (CF) play an important role in e- 
commerce, because of the easy implementation and high recommendation accu¬ 
racy. Two classic problems, cold-start problem and scalability problem, challenge 
the task of dynamically maintaining similarity list in neighbourhood-based CF. 
Recently, several methods are presented on solving the two problems. However, 
these methods applied a traditional O(n^) algorithm to compute the similarity 
list in a special case: the new users, with enough recommendation data, have 
the same rating list. To address the problem of large computational cost due 
to the special case, we design a faster {0{j^n‘^)) algorithm to build new users’ 
similarity list, which avoids computing and sorting the similarity list to save the 
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computational resources. Both theoretical and experimental results show that 
our algorithm achieves better running time than the traditional method. 
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